 (August 1998).iso/images/contents.GIF)
 (August 1998).iso/images/ful.GIF)
 (August 1998).iso/images/trial.GIF)
 (August 1998).iso/images/news.GIF)
 (August 1998).iso/images/online.GIF)
 (August 1998).iso/images/ho.gif)
 (August 1998).iso/images/ess.GIF)
 (August 1998).iso/images/tact.GIF)
 (August 1998).iso/images/help.GIF)
|
WilfÆs ProgrammersÆ Workshop
Continued from the magazine...
Suppose you have a group of records in memory in sorted sequence.
If you change the value of one of the items, you may well be destroying the
sorted sequence and there will have to be modifications to bring it back into
order. What steps are needed? How many comparisons and location swaps are
needed? How much extra memory is required for storing vectors?
These are all crucial questions û and depending on the sorting method you use,
by no means trivial. Even with the highly efficient quick sort (using binary trees)
you can run into some very sticky business. Changing a value so that it affects
sorted sequence means cutting the changed item from its current location in the tree
(which involves rewiring the vectors of the items immediately below it in the tree)
and then making a series of comparisons to find its new logical place.
Unfortunately, even simply adding an item into a quick sort tree can take considerable time
if, for example, the tree is skewed û and that can happen when the input occurs in roughly
sorted or reverse order: a quick sort works best when data starts out in random order.
A heap is different: the number of comparisons and swaps needed to change or add an item
varies only a little: even with a thousand items in the heap, only ten comparisons
(at most) need to be done. Can this fact be exploited? Can a heap help us speed up
a sort that we suspect will be slow?
How a heap sort works
A heap sort is exactly what it sounds like û a heap construction, followed by a sort.
LetÆs take an example. We have a list of items we want in sorted sequence, smallest to
greatest. Suppose we use the following numbers, input as a stream in the following
unsorted sequence: 20, 111, 184, 226, 204, 79, 125. Using the HeapAdd procedure with
the greatest value coming to the top, we will first of all create a heap.
 (August 1998).iso/images/wpw142xa.gif)
Using the HeapAdd procedure repeatedly (which calls HeapSettle) we first
build a top-heavy heap of all the items to be sorted.
This only provides immediate access to the largest number in the heap, which we know
will always rise to the top (item 1). We have made the heap û the second half of the
task is to turn it into a sorted array. This is easier than you may think.
We take three further steps, and repeat them until the heap is æemptyÆ:
1: Swap the first and last items in the heap; this puts the highest value at the
end of the heap.
2: Reduce the size of the heap by one item û effectively leaving off the recently
moved highest item.
3: Settle the new first item within the heap using the HeapSettle procedure.
One by one we move the items from back to front, consuming the heap as we go.
Calculating the advantage
Now think about how many operations this will take. Adding each item into the heap
will take an average of (log2n)/2 operations, comparing and swapping, where n is the
number of items in the heap. For example log2128 is seven (because 27 is 128), and that
tells us how many levels there are.
A large item may have to be swapped upward through all seven levels, but the average
will be about half that. Of course the size of the heap increases all the time, and so
log2n increases with time. However if we take the maximum value of n we will be fairly
safe, because that value of log2n will be accurate for half the ælifeÆ of the heap. Our
estimate will be a little high, but in the right ball park.
The second part of the process û reducing the heap while building the sorted array in
reverse order û is a little more cumbersome, because HeapSettle will be used to move
item 1 downward in the heap. This process requires two comparisons for every level
instead of one. But again, each swap/reduce/settle cycle will take an average of
(log2n)/2 operations.
Taking together these two halves, we can see that a heap sort of n items takes about
log2n operations, each consisting of 1.5 comparisons (the average of one for going
upward, two for going downward) and one swap. There is not much difference whether
the input starts in sorted, reverse-sorted, or random order.
How does this compare with the quick sort, which we have used often in the past? The
average number of comparisons you have to make (from the root downwards) to find the
proper sorted location for the nth item is log2n. Even ignoring the time to do a swap
of two items, the heap sort takes 50 per cent more time doing comparisons for each
operation û and since the number of operations is the same on average, this makes
the quick sort a clear winner.
But the story doesnÆt end there. Suppose the items occur in sorted sequence already
û or close to sorted sequence. In this unfortunate case the number of layers is no
longer log2n, but a lot closer to n itself, with the tree looking more like a
long chain. With 100 items this would mean that a quick sort would take more than
nine times as long as a heap sort. With 1000 items the heap sort can outstrip the
quick sort by a factor of 60.
What can we say from all this? A heap sort is an excellent means for sorting a large
number of items which will be presented already well-sorted (or in æclumpsÆ that would
make quick sort slower). This also goes for items that will be presented in more or
less the opposite of sorted order. Note that a heap sort works fine with randomly
presented data as well, but just not as fast as a quick sort would.
 (August 1998).iso/images/wpw142xb.gif)
At the start of the first cycle of a heap sort, the first and last items
in the heap are swapped, and the heap shortened.
 (August 1998).iso/images/wpw142xc.gif)
The end of the first cycle sees the heap adjusted with a new ætopÆ item
with the highest value.
 (August 1998).iso/images/wpw142xd.gif)
The second cycle proceeds exactly as the first cycle does: the first and
last items are swapped and the heap shortened.
 (August 1998).iso/images/wpw142xe.gif)
The second cycle ends with a smaller heap, and two sorted items immediately
following the heapÆs end.
 (August 1998).iso/images/wpw142xf.gif)
After several cycles the heap is gone, and in its place all the items appear
in sequential order.
Here on the SuperCD are simple Visual Basic routines for heap maintenance and doing a
heap sort within a simple two-dimensional array. The first item in each entry of the
array is the sequence (or sort) key: though we have been dealing with numbers, alphabetic
sequences work just as well. The second item in the array is a vector: it can point to
the full record, or perhaps it is a merge file number detailing from which file the
record was read, and therefore which file has to be read again.
I like the ætrickÆ of putting the current size of the heap in the sequence key of the
zero item of the heap array. You may instead like to have it as a global variable, or
a passed variable. Using the zero item for passing vital information about the array
itself strikes me as a simple and elegant way of incorporating a bit of object
orientation: the array becomes the heap object itself, and item 0,1 becomes its
size attribute.
We will be using heaps in various projects in the future û they are extremely useful,
and surprisingly easy to manage.
Visual Basic program code examples
Examples of SUBs that will maintain a heap, and the method of using them to perform a
heap sort, are demonstrated in PROGCODE.TXT. Feel free to adapt and improve them for
your own purposes, and convert them into other computer languages.
To view the text file.
|
|
|